Music is ubiquitous ever since humans exist. Prehistoric instruments have been found and thought to be at least 40,000 years old. Music is a pilar of human civilisation; it relates to people’s identities, feelings and thoughts. Hence, means of saving and sharing music are of invaluable importance. The oldest surviving notated music work Hurrian Hymn to Nikkal found on clay tablets dates back to 1400 BC.
Various systems were developped around the globe for visually representing perceived music through the use of written symbols. The modern western notation is the predominent musical notation worldwide for most music genres.
With the rise of technology, audio recordings where introduced as analog signals and eventually as digital signals, providing means for sharing and sauveguarding music aurally.
Music theory and musical notation have been studied for centuries, allowing humans and machines to retrieve music information from common formats. Nevertheless, music processing is a relatively young discipline compared to other subdomains of signal processing such as speech processing; while great results are achieved today in speech recognition, the task of retreiving music information from audio recordings is still far along.
Automatic Music Transcription (AMT) is the task of analyzing musical audio signals and producing the corresponding musical scores. This task has captured researchers interest in the late 20th century and has become a wide research discipline as many of the problems in this domain remain unsolved. Furthurmore, strides in the domain of AMT would apply to numerous applications that can facilitate creating, sharing, and learning music.
The scope of this thesis is the domain of Automatic Music Transcription and the underlying tasks. We explore the state of the art and propose an implementation for a subset of the presented methods.
I have held interest for this project for quite some time, partly because I am a violinist myself but also because of my fondness of the employed mathematical principles. Most importantly, this project requires application of various mathematical notions as well as computer science skills hence serving as a demonstration of acquired knowledge throughout the Masters program.
The focus of this project is music information retrieval from music audio signals. In this section we study defining characteristics of musical elements, human perception of music, and basic notions of modern music theory. We also review the main characteristics of a sound wave as well as analytic tools for processing digital audio signals. Furthermore, we establish the bridge between music theory and physical properties of audio signals.
Sound is generated by vibrating objects, these vibrations cause oscillations of molecules in the medium. The varying pressure propagates through the medium as a wave, the pressure is therefore the solution of the wave equation in time and space, also known as the acoustic wave equation. \[\Delta p =\frac{1}{c^2}\frac{\partial^2 p}{ {\partial t}^2}\] where \(p\) is the accoustic pressure function of time and space and \(c\) is the speed of sound propagation. The wave equation can be solved analytically with the separation of variables method, resulting in a sinusoidal harmonic solutions.
In audio signal processing, we are interested in the pressure at the receptor’s position (listener or microphone), hence the pressure as a function of time. An audio signal is therefore defined as the deviation of pressure from the average pressure of the medium at the receptor’s position.
The pressure function being harmonic, the sound signal is of the form \[\tilde{x}(t) = \sum_{h=0}^{\infty} A_h \cos(2\pi hf_0t + \varphi_h)\] where
In many works this formula appears in terms of the angular frequency \(\omega=2\pi f\), we denote as well \(f_h = h f_0\) for \(h\geq 1\).
As harmonics represent proper multiples of the fundamental frequency, \(h=0\) is excluded from the sum \[\tilde{x}(t) = a_0 + \sum_{h=1}^{\infty} A_h \cos(2\pi hf_0t + \varphi_h)\] with \(a_0 = A_0\cos(\varphi_0)\).
The human auditory system is capable of distinguishing intensities and frequencies of sound waves as well as temporal features. The inner ear is extremely sensitive to sound wave features, the brain allows furthur analysis of these features.
Music theory defines and studies perceived features of music signals. These features are based on the signal’s intensity, frequency, and time patterns.
In music theory, a note is a musical symbol that represents the smallest musical object. The note’s attributes define the pitch of the sound, its relative duration and its relative intensity.
Sound signals are periodic, therefore by definition there exists a \(T>0\) such as \[\forall t, \tilde{x}(t)=\tilde{x}(t+T)\] which follows that there exists an infinite set of values of \(T>0\) that verify this property, indeed \(\forall n\in\mathbb{N}, T'=nT, \tilde{x}(t)=\tilde{x}(t+T')\). We define the period of a signal as the smallest positive value of \(T\) for which the property holds. The fundamental frequency \(f_0\) is defined formally as the reciprocal of the period. This definition holds for any periodic signal, regardless of its form.
In the case of sound wave, the perception of the fundamental frequency is referred to as the pitch. Pitch is the defined as the tonal height of a sound, it is closely related to the fundamental frequency however remaining a relative musical concept unlike the \(f_0\) of a signal that is an absolute mathematical value. In fact, the relation between pitch and \(f_0\) is neither bijective nor invariant.
In music theory, pitch is defined on a discrete space unlike the continuous frequency space. Moreover, human perception of frequency is logarithmic hence obtaining the next pitch corresponds to the multiplication of the frequency by a certain value \(r\).
Finally, the frequency of the reference pitch A4 is widely accepted today as \(440 Hz\) while in the baroque era it was around \(415 Hz\) and \(440 Hz\) was the frequency corresponding to A♯ pitch. Even in modern day, variations of the pitch frequency exist in different regions and even different orchestras!
Sound intensity is defined physically as the power carried by sound waves per unit area, whereas sound pressure is the local pressure deviation from the ambient atmospheric pressure caused by a sound wave. Human perception of intensity is directly sensitive to sound pressure, it is measured in terms of sound pressure level (SPL) which is a logarithmic measure of sound pressure \(P\) relative to the atmospheric pressure \(P_0\) measured in decibels \(\mathrm{dB}\). \[\mathrm{SPL} = 20\log_{10}\left(\frac{P}{P_0}\right) \mathrm{dB}\]
Nevertheless, sensitivity to sound intensity is variable across different frequencies. The subjective perception of sound pressure is defined by a sound’s loudness which is a function of both SPL and frequency ranging from quiet to loud.
In music theory, loudness is defined by a piece’s dynamics. Dynamics are indicators of a part’s loudness relative to other parts and/or instruments. Dynamics markings are expressed with the italian keywords forte \(\boldsymbol{f}\) (loud) and piano \(\boldsymbol{p}\) (soft). Subtle degrees of loudness can be expressed by the prefixes mezzo- or più, for example \(\boldsymbol{mp}\) stands for mezzo-piano (moderately soft) and \(pi\grave{u}~\boldsymbol{p}\) (softer), or by consecutive letters such as fortissimo \(\boldsymbol{f}\hspace{-2pt}\boldsymbol{f}\) (very loud) or more letters if needed.
Music dynamics also allow expressing gradual changes in loudness, indicated as symbols or italian keywords (crescendo and diminuendo).
The domain of audio signal processing deals with recorded digital/analog signals, which are discrete-time signals. The Nyquist-Shannon sampling theorem is the fundamental bridge between continuous-time and discrete-time signals. It establishes a sufficient condition for a sample rate that permits a discrete sequence of samples to capture all the information from a continuous-time signal. (“Nyquist–Shannon Sampling Theorem” 2020)
Pitch analysis is the task of estimating the fundamental frequency of a periodic signal that is the inverse of the period which is defined as “the smallest positive member of the infinite set of time shifts leaving the signal invariant” (Cheveigné and Kawahara 2002). As music signal frequencies vary through time, the pitch analysis is usually performed on a short time frame (window) allowing to express the obtained pitch as a function of time, we will consider henceforth the analysis on a single frame.
Furthermore, the physical model we have considered for the signal formula is based on physical hypotheses. In fact, we considered a signal formed by a perfectly harmonic instrument travelling in a perfectly undisturbed homogenuous medium with no other iterfering waves. Since such conditions are almost never met, we base our analysis on imperfect conditions. Indeed, the recorded signal represents the pressure function at the receptors position. Consequently, the recorder captures the pressure at its position from all surrounding stimuli, recording surrounding noise, resonance effects, and the reflected wave with a certain lag. As a result, we express the observed signal as the sum of the harmonic signal \(\tilde{x}\) and the residual \(z\). (Yeh, n.d.) \[x(t) = \tilde{x}(t) + z(t)\]
Before we move on, let’s consider the harmonicity of a sound. In the case of perfectly harmonic instrument the frequency of harmonic partials is expressed as a proper multiple of the fundamental frequency \(f_h = h f_0\). However, most musical instruments are not perfectly harmonic, for example the \(h^\text{th}\) harmonic frequency of a vibrating string is given as \[ f_h = h f_0 \sqrt{1 + Bh^2} \quad\text{where}\quad B = \frac{\pi^3 Ed^4}{64l^2T}\] where \(B\) is the inharmonicity factor of the string, \(E\) is Young’s modulus, \(d\) is the diameter of the string, \(l\) is its length and \(T\) is its tension. We refer to such signals as quasi-periodic. Pitch analysis therefore has to take into account the inharmonicity of a signal in the process of estimating its fundamental frequencies in order to prevent cases of false negatives (missed pitches). [source needed]
Pitch analysis deals with both monophonic and polyphonic signals, a monophonic signal is a signal produced by a single harmonic source whereas polyphonic signals have multiple sources, in the case of the latter the task is significantly harder. Nevertheless, pitch estimation methods for both single and multiple sourced harmonics can be classified into two categories: methods that estimate the period in the signal time domain and methods that estimate the \(f_0\) from the harmonic patterns in the signal spectrum.
Single pitch estimation is based on finding the fundamental frequency of a monophonic sound. The quasi-periodic monophonic signal \(\tilde{x}\) is expressed as \[\tilde{x}(t)=\sum_{h=1}^{\infty} A_h\cos(2\pi f_0 t + \varphi_h)\] For practical reasons, a finite number of harmonic partials \(H\) is used to approximate the signal. \[\tilde{x}(t)\approx\sum_{h=1}^{H} A_h\cos(2\pi f_0 t + \varphi_h)\]
The estimation of \(f_0\) can be approached in two different ways: by analysing the time function \(x(t)\) or by analysing the signal spectrum \(X(f)\).
Time domain methods analyse the repetitiveness of the wave by comparing the signal with a delayed version of itself. This comparison is achieved using special functions that represent the pattern similarity or dissimilarity as a function of the time lag \(\tau\).
We will study and compare a the functions that appear the most in litterature.
The autocorrelation function (ACF) comes immediately to mind. By definition, autocorrelation is the similarity function between observations. Given a discrete signal of \(N\) samples, the autocorrelation function is defined as \[r[\tau] = \sum_{t=1}^{N-\tau} x[t]x[t+\tau]\]
The value is of the ACF is at a local maximum when the lag is equal to the signal’s period or its multiples. Autocorrelation is sensitive to structures in signals, making it useful to applications of speech detection. However, in the case of music signals, resonance structures appear hence the need for a better adapted function.
The Average Magnitude Difference Function (AMDF) (Ross et al. 1974) is the average unsigned difference between \(x(t)\) and \(x(t+\tau)\). \[d_{\text{AM}}[\tau] = \frac{1}{N} \sum_{t=1}^{N-\tau} \left\lvert x[t]-x[t+\tau]\right\rvert\] The difference function is at its local minima for lags equal to proper multiples of the signals period. AMDF is more adapted than autocorrelation for applications in music processing.
The Squared Difference Function (SDF) is very similar to AMDF, it accentuates however the dips at the signals period therefore indicate local extrema more clearly. \[d[\tau] = \sum_{t=1}^{N-\tau}(x[t]-x[t+\tau])^2\]
YIN algorithm (Cheveigné and Kawahara 2002) employs the SDF as an auxiliary function for calculating the cumulative mean normalized difference function that divides SDF by its average over shorter lags and starts at 1 rather than 0 (in the case of SDF and AMDF); it tends to stay large at short lags and drops when SQD falls under its average.
\[d_{\text{YIN}}[\tau] = \begin{cases} 1 &\text{if}~\tau = 0\\ d[\tau] / \frac{1}{\tau}\sum\limits_{t=0}^{\tau} d[t] &\text{otherwise} \end{cases}\]
from muallef.io import AudioLoader
from muallef.plot import diff_functions as df
cello = AudioLoader('samples/instrument_single/cello_csharp2.wav')
cello.cut(start=2, stop=2.06)
df.time_domain_plots(cello.signal, cello.sampleRate, pitch=69.3)
Fourier transform is the most adapted mathematical tool for analysing periodicity in functions. The transform produced a complex function of frequency, where the magnitude of the transform attains its local maxima at the signal’s frequency and its harmonics.